Introduction

Klepousniotou et al (2008) and Brown (2008) both showed that the ease of transitioning between two senses of a polysemous word depends on their degree of overlap. That is, there is less inhibition (greater facilitation) between marinated lamb and baby lamb than between control panel and advisory panel. They measured priming within a sense category as a comparison, e.g., between marinated lamb and tender lamb. Klepousniotou et al (2008) also analyzed an interaction with sense dominance, but I will not be analyzing that here.

In the current analyses, I adapted most of their stimuli to sentences, e.g., "He liked the marinated lamb" vs. "He liked the baby lamb". I also created new stimuli using homonyms from Klepousniotou & Baum (2007).

I then ran each sentence through ELMo and BERT and obtained contextualized embeddings for the target word, e.g., lamb. Finally, I computed the cosine distance between the two embeddings. This allows us to make several comparisons:

First, is cosine distance consistently larger for usages occurring across senses? That is, is \(cos(lamb_{marinated}, lamb_{baby}) > cos(lamb_{marinated}, lamb_{tender})\)? If so, it suggests that the contextualized embeddings distinguish between contexts of use in such a way as to correlate with sense boundaries.

Second: for usages occurring across senses (e.g., baby/marinated lamb), does cosine distance vary as a function of ambiguity type (e.g., polysemy vs. homonymy)?

Our data thus looks as follows: each observation represents a comparison (cosine distance) between two identical wordforms appearing in different contexts. 1/3 of these observations reflect same sense contexts; 2/3 reflect different sense contexts. Furthermore, condition reflects the degree of overlap between the different sense-contexts.

Random factors include:

  • Each word
  • The model

For word, we include by-factor random slopes for the effect of same sense.

Load data

## setwd("/Users/seantrott/Dropbox/UCSD/Research/Ambiguity/SSD/raw-c/src/analysis")
df_distances = read_csv("../../data/processed/stims_with_nlm_distances.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Class = col_character(),
##   ambiguity_type = col_character(),
##   ambiguity_type_mw = col_character(),
##   ambiguity_type_oed = col_character(),
##   different_frame = col_character(),
##   distance_bert = col_double(),
##   distance_elmo = col_double(),
##   overlap = col_character(),
##   same = col_logical(),
##   source = col_character(),
##   string = col_character(),
##   version = col_character(),
##   word = col_character()
## )
nrow(df_distances)
## [1] 690

We leave out items for which it was unclear whether the different versions were truly different senses. TODO: Discuss with Ben about this approach.

df_distances = df_distances %>%
  filter(ambiguity_type != "Unsure")
nrow(df_distances)
## [1] 672
length(unique(df_distances$word))
## [1] 112
table(df_distances$same)
## 
## FALSE  TRUE 
##   448   224
table(df_distances$same) / nrow(df_distances)
## 
##     FALSE      TRUE 
## 0.6666667 0.3333333
table(df_distances$ambiguity_type) / 6
## 
## Homonymy Polysemy 
##       38       74
table(df_distances$ambiguity_type, df_distances$Class) / 6
##           
##             N  V
##   Homonymy 31  7
##   Polysemy 53 21

Primary analyses

Comparing BERT and ELMo distances

df_distances %>%
  ggplot(aes(x = distance_elmo,
             y = distance_bert,
             color = same)) +
  geom_point(alpha = .5) +
  theme_minimal() +
  geom_smooth(method = "lm") +
  facet_grid(~ambiguity_type)
## `geom_smooth()` using formula 'y ~ x'

cor.test(df_distances$distance_bert,
         df_distances$distance_elmo,
         method = 'spearman')
## 
##  Spearman's rank correlation rho
## 
## data:  df_distances$distance_bert and df_distances$distance_elmo
## S = 23277112, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5397715

Is cosine distance larger for usages across senses?

First, we ask whether the existence of a sense boundary explains significant variance in the cosine distance between two words.

In this analysis, we add a random effect for the model being used to assess cosine distance.

df_distances_reshaped = df_distances %>%
  mutate(elmo = distance_elmo,
         bert = distance_bert) %>%
  pivot_longer(c(elmo, bert), names_to = "model",
               values_to = "distance")

model_same = lmer(data = df_distances_reshaped,
                  distance ~ same + 
                    Class + 
                    (1 | model) +
                    (1 + same | word),
                  control=lmerControl(optimizer="bobyqa"),
                  REML=FALSE)

model_reduced = lmer(data = df_distances_reshaped,
                  distance ~
                    Class + 
                    (1 | model) +
                    (1 + same | word),
                  control=lmerControl(optimizer="bobyqa"),
                  REML=FALSE)

summary(model_same)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: distance ~ same + Class + (1 | model) + (1 + same | word)
##    Data: df_distances_reshaped
## Control: lmerControl(optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##  -2635.6  -2593.9   1325.8  -2651.6     1336 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6646 -0.5966 -0.0210  0.4393  5.1259 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  word     (Intercept) 0.002891 0.05377       
##           sameTRUE    0.001107 0.03328  -0.99
##  model    (Intercept) 0.007822 0.08844       
##  Residual             0.007112 0.08433       
## Number of obs: 1344, groups:  word, 112; model, 2
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  0.213266   0.062853   3.393
## sameTRUE    -0.099220   0.005805 -17.091
## ClassV      -0.004643   0.009522  -0.488
## 
## Correlation of Fixed Effects:
##          (Intr) smTRUE
## sameTRUE -0.065       
## ClassV   -0.038  0.000
anova(model_same, model_reduced)
## Data: df_distances_reshaped
## Models:
## model_reduced: distance ~ Class + (1 | model) + (1 + same | word)
## model_same: distance ~ same + Class + (1 | model) + (1 + same | word)
##               npar     AIC     BIC logLik deviance  Chisq Df Pr(>Chisq)    
## model_reduced    7 -2493.8 -2457.4 1253.9  -2507.8                         
## model_same       8 -2635.6 -2593.9 1325.8  -2651.6 143.72  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that it does. We can illustrate this visually as well:

df_distances_reshaped %>%
  ggplot(aes(x = distance,
             fill = same,
             y = model)) +
  geom_density_ridges2(alpha = .6) +
  theme_minimal() +
  labs(x = "Cosine Distance",
       y = "Model") +
  facet_wrap(~ambiguity_type)+
  theme(axis.title = element_text(size=rel(2)),
        axis.text = element_text(size = rel(2)),
        legend.text = element_text(size = rel(2)),
        legend.title = element_text(size = rel(2)),
        strip.text.x = element_text(size = rel(2)))
## Picking joint bandwidth of 0.0213
## Picking joint bandwidth of 0.0208

ggsave("../../Figures/cosine_distances.pdf", dpi = 300)
## Saving 7 x 5 in image
## Picking joint bandwidth of 0.0213
## Picking joint bandwidth of 0.0208

Does cosine distance vary as a function of the type of ambiguity?

Above, we saw that the cosine distance between two usages varied as a function of whether those usages belonged to the same sense.

We also show that a model with both condition and same does not explain more variance than a model with only same; we don't really expect it to, given that same sense pairs are included here.

model_both = lmer(data = df_distances_reshaped,
                  distance ~ same + 
                    Class + 
                    ambiguity_type +
                    (1 | model) +
                    (1 + same | word),
                  control=lmerControl(optimizer="bobyqa"),
                  REML=FALSE)
## boundary (singular) fit: see ?isSingular
anova(model_both, model_same)
## Data: df_distances_reshaped
## Models:
## model_same: distance ~ same + Class + (1 | model) + (1 + same | word)
## model_both: distance ~ same + Class + ambiguity_type + (1 | model) + (1 + 
## model_both:     same | word)
##            npar     AIC     BIC logLik deviance  Chisq Df Pr(>Chisq)
## model_same    8 -2635.6 -2593.9 1325.8  -2651.6                     
## model_both    9 -2634.7 -2587.8 1326.3  -2652.7 1.0846  1     0.2977

But there is also no significant interaction between condition and same (though it is marginal / trending)

model_interaction = lmer(data = df_distances_reshaped,
                     distance ~ ambiguity_type * same + 
                       Class +
                    (1 | model) +
                    (1 + same | word),
                     control=lmerControl(optimizer="bobyqa"),
                     REML=FALSE)
## boundary (singular) fit: see ?isSingular
anova(model_both, model_interaction)
## Data: df_distances_reshaped
## Models:
## model_both: distance ~ same + Class + ambiguity_type + (1 | model) + (1 + 
## model_both:     same | word)
## model_interaction: distance ~ ambiguity_type * same + Class + (1 | model) + (1 + 
## model_interaction:     same | word)
##                   npar     AIC     BIC logLik deviance  Chisq Df Pr(>Chisq)
## model_both           9 -2634.7 -2587.8 1326.3  -2652.7                     
## model_interaction   10 -2634.8 -2582.8 1327.4  -2654.8 2.1907  1     0.1388
summary(model_interaction)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: distance ~ ambiguity_type * same + Class + (1 | model) + (1 +  
##     same | word)
##    Data: df_distances_reshaped
## Control: lmerControl(optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##  -2634.8  -2582.8   1327.4  -2654.8     1334 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7451 -0.6002 -0.0174  0.4487  5.0784 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  word     (Intercept) 0.002893 0.05379       
##           sameTRUE    0.001078 0.03284  -1.00
##  model    (Intercept) 0.007822 0.08844       
##  Residual             0.007100 0.08426       
## Number of obs: 1344, groups:  word, 112; model, 2
## 
## Fixed effects:
##                                  Estimate Std. Error t value
## (Intercept)                      0.216031   0.063353   3.410
## ambiguity_typePolysemy          -0.003760   0.012308  -0.305
## sameTRUE                        -0.111194   0.009922 -11.207
## ClassV                          -0.005768   0.009564  -0.603
## ambiguity_typePolysemy:sameTRUE  0.018124   0.012207   1.485
## 
## Correlation of Fixed Effects:
##             (Intr) ambg_P smTRUE ClassV
## ambgty_typP -0.125                     
## sameTRUE    -0.111  0.572              
## ClassV      -0.028 -0.077  0.000       
## ambg_P:TRUE  0.090 -0.704 -0.813  0.000
## convergence code: 0
## boundary (singular) fit: see ?isSingular

We can also compare the model's predictions against the real values for cosine distance.

df_distances_reshaped$predictions = predict(model_interaction)

df_distances_reshaped %>%
  ggplot(aes(x = predictions,
             y = distance,
             color = same,
             shape = ambiguity_type)) +
  geom_point(alpha = .4) +
  facet_grid(~model) +
  theme_minimal()

Conclusion

Both BERT and ELMo appear to capture whether two usages of a wordform correspond to same or different senses (as determined by Merriam-Webster/OED): Cosine Distance is larger for different sense than same sense usages.

The relationship between Cosine Distance and Ambiguity Type (i.e., Homonymy/Polysemy) is less clear, however: while the interaction Ambiguity Type * Same Sense is trending, it does not appear that Cosine Distance is reliably larger for different sense homonyms than different sense polysems.

In future work, we will compare each measure---Cosine Distance, as well as Same Sense and Ambiguity Type---to human judgments of relatedness. Ultimately, each measure will be used to predict human behavior (Accuracy and RT) on a primed sensibility judgment task.